What is Corpus Linguistics?

نویسنده

  • Stefan Th. Gries
چکیده

Corpus linguistics is one of the fastest-growing methodologies in contemporary linguistics. In a conversational format, this article answers a few questions that corpus linguists regularly face from linguists who have not used corpus-based methods so far. It discusses some of the central assumptions (‘formal distributional differences reflect functional differences’), notions (corpora, representativity and balancedness, markup and annotation), and methods of corpus linguistics (frequency lists, concordances, collocations), and discusses a few ways in which the discipline still needs to mature. At a recent LSA meeting ... [with an obvious bow to Frederick Newmeyer] Question: So, I hear you’re a corpus linguist. Interesting, I get to see more and more abstracts and papers and even job ads where experience with corpus-based methods are mentioned, but I actually know only very little about this area. So, what’s this all about? Answer: Yes, it’s true, it’s really an approach that’s gaining more and more prominence in the field. In an editorial of the flagship journal of the discipline, Joseph (2004:382) actually wrote ‘we seem to be witnessing as well a shift in the way some linguists find and utilize data – many papers now use corpora as their primary data, and many use internet data’. Question: My impression exactly. Now, you say ‘approach’, but that’s something I’ve never really understood. Corpus linguistics – is that a theory or model or a method or what? Answer: Good question and, as usual, people differ in their opinions. One well-known corpus linguist, for example, considers corpus linguistics – he calls it computer corpus linguistics – a ‘new philosophical approach [...]’ Leech (1992:106). Many others, including myself, consider it a method(ology), no more, but also no less (cf. McEnery et al. 2006:7f ). However, I don’t think this difference would result in many practical differences. Taylor (2008) discusses this issue in more detail, and for an amazingly comprehensive overview of how huge and diverse the field has become, cf. Lüdeling and Kytö (2008, 2009). Question: Hm ... But if you think corpus linguistics is a methodology, .... Well, let me ask you this: usually, linguists try to interpret the data they investigate against the background of some theory. Generative grammarians interpret their acceptability judgments within Government and Binding Theory or the Minimalist Program; some psycholinguists interpret their reaction time data within, for example, a connectionist interactive Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x a 2009 The Author Journal Compilation a 2009 Blackwell Publishing Ltd activation model – now if corpus linguistics is only a methodology, then what is the theory within which you interpret your findings? Answer: Again as usual, there’s no simple answer to this question; it depends .... There are different perspectives one can take. One is that many corpus linguists would perhaps even say that for them, linguistic theory is not of the same prime importance as it is in, for example, generative approaches. Correspondingly, I think it’s fair to say that a large body of corpus-linguistic work has a rather descriptive or applied focus and does actually not involve much linguistic theory. Another one is that corpus linguistic methods are a method just as acceptability judgments, experimental data, etc. and that linguists of every theoretical persuasion can use corpus data. If a linguist investigates how lexical items become more and more used as grammatical markers in a corpus, then the results are descriptive and ⁄or most likely interpreted within some form of grammaticalization theory. If a linguist studies how German second language learners of English acquire the formation of complex clauses, then he will either just describe what he finds or interpret it within some theory of second language acquisition and so on... . There’s one other, more general way to look at it, though. I can of course not speak for all corpus linguists, but I myself think that a particular kind of linguistic theory is actually particularly compatible with corpus-linguistic methods. These are usage-based cognitive-linguistic theories, and they’re compatible with corpus linguistics in several ways. (You’ll find some discussion in Schönefeld 1999.) First, the units of language assumed in cognitive linguistics and corpus linguistics are very similar: what is a unit in probably most versions of cognitive linguistics or construction grammar is a symbolic unit or a construction, which is an element that covers morphemes, words, etc. Such symbolic units or constructions are often defined broadly enough to match nearly all of the relevant corpus-linguistic notions (cf. Gries 2008a): collocations, colligations, phraseologisms, .... Lastly, corpus-linguistic analyses are always based on the evaluation of some kind of frequencies, and frequency as well as its supposed mental correlate of cognitive entrenchment is one of several central key explanatory mechanisms within cognitively motivated approaches (cf., e.g. Bybee and Hopper 1997; Barlow and Kemmer 2000; Ellis 2002a,b; Goldberg 2006). Question: Wait a second – ‘corpus-linguistic analyses are always based on the evaluation of some kind of frequencies?’ What does that mean? I mean, most linguistic research I know is not about frequencies at all – if corpus linguistics is all about frequencies, then what does corpus linguistics have to contribute? Answer: Well, many corpus linguists would probably not immediately agree to my statement, but I think it’s true anyway. There are two things to be clarified here. First, frequency of what? The answer is, there are no meanings, no functions, no concepts in corpora – corpora are (usually text) files and all you can get out of such files is distributional (or quantitative ⁄ statistical) information: ) frequencies of occurrence of linguistic elements, i.e. how often morphemes, words, grammatical patterns etc. occur in (parts of) a corpus, etc.; this information is usually represented in so-called frequency lists; ) frequencies of co-occurrence of these elements, i.e. how often morphemes occur with particular words, how often particular words occur in a certain grammatical 2 Stefan Th. Gries a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd construction, etc.; this information is mostly shown in so-called concordances in which all occurrences of, say, the word searched for are shown in their respective contexts. Figure 1 is an example. As a linguist, you don’t just want to talk about frequencies or distributional information, which is why corpus linguists must make a particular fundamental assumption or a conceptual leap, from frequencies to the things linguists are interested in, but frequencies is where it all starts. Second, what kind of frequency? The answer is that the notion frequency doesn’t presuppose that the relevant linguistic phenomenon occurs in a corpus 100 or 1000 times – the notion of frequency also includes phenomena that occur only once or not at all. For example, there are statistical methods and models out there that can handle non-occurrence or estimate frequencies of unseen items. Thus, corpus linguistics is concerned with whether ) something (an individual element or the co-occurrence of more than one individual element) is attested in corpora; i.e. whether the observed frequency (of occurrence or co-occurrence) is 0 or larger; ) something is attested in corpora more often than something else; i.e. whether an observed frequency is larger than the observed frequency of something else; ) something is observed more or less often than you would expect by chance [this is a more profound issue than it may seem at first; Stefanowitsch (2006) discusses this in more detail]. This also implies that statistical methods can play a large part in corpus linguistics, but this is one area where I think the discipline must still mature or evolve. Fig. 1. A concordance output from AntConc 3.2.2w. What is Corpus Linguistics? 3 a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd Question: What do you mean? Answer: Well, this is certainly a matter of debate, but I think that a field that developed in part out of a dissatisfaction concerning methods and data in linguistics ought to be very careful as far as its own methods and data are concerned. It is probably fair to say that many linguists turned to corpus data because they felt there must be more to data collection than researchers intuiting acceptability judgments about what one can say and what one cannot; cf. Labov (1975) and, say, Wasow and Arnold (2005:1485) for discussion and exemplification of the mismatch between the reliability of judgment data by prominent linguists of that time and the importance that was placed on them, as well as McEnery and Wilson (2001: Ch. 1), Sampson (2001: Chs 2, 8, and 10), and the special issue of Corpus Linguistics and Linguistic Theory (CLLT ) 5.1 (2008) on corpus linguistic positions regarding many of Chomsky’s claims in general and the method of acceptability judgments in particular. However, since corpus data only provide distributional information in the sense mentioned earlier, this also means that corpus data must be evaluated with tools that have been designed to deal with distributional information and the discipline that provides such tools is statistics. And this is actually completely natural: psychologists and psycholinguists undergo comprehensive training in experimental methods and the statistical tools relevant to these methods so it’s only fair that corpus linguists do the same in their domain. After all, it would be kind of a double standard to on the one hand bash many theoretical linguists for their presumably faulty introspective judgment data, but on the other hand then only introspectively eyeball distributions and frequencies. For a while, however, this is exactly what has happened, but the picture is now changing a lot – cf. Mukherjee (2007) for discussion – and more and more corpus linguists use increasingly sophisticated statistical and ⁄or computational methods (cf. papers in Bod et al. 2003). As a result, not only do corpus-linguistic studies become more comprehensive and more precise, the larger degree of quantification also does more justice to language as a multifactorial object of study and makes it easier to relate corpusbased findings to experimental findings. There are now many studies not only in corpus linguistics in particular but also in linguistics in general that combine evidence from these two kinds methods (cf. Kepser and Reis 2005 and CLLT 5.1 again). Question: Ah, ok, well, that makes sense. But even then ... one the one hand, it sounds like corpus linguistics should be widely applicable to linguistic problems, but on the other hand, this focus on frequencies and quantification makes it sound as if the applicability of corpus data or the relevance of corpus-linguistic approaches may be severely limited. Or, what is this ‘assumption’ or ‘conceptual leap’ that allows you to do more than just distributional number-crunching? Answer: Oh, on the contrary, corpus data are very widely applicable! Although it’s not always openly stated, this assumption underlying most corpus-based analyses is that formal differences reflect, or correspond to, functional differences. Thus, different frequencies of (co-)occurrences of formal elements – again, morphemes, words, syntactic patterns, etc. – are assumed to reflect functional regularities, and ‘functional’ is understood here in a very broad sense as anything – be it semantic, discourse-pragmatic, ... – that is intended to perform a particular communicative function. Question: I am not sure I get this: how do frequencies reflect functional regularities? Uh ... 4 Stefan Th. Gries a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd Answer: Well, one could go back to Bolinger’s (1968:127) famous dictum that ‘a difference in syntactic form always spells a difference in meaning’, the principle of no synonymy, Harris’s (1970:786) ‘difference of meaning correlates with difference of distribution’, or Goldberg’s (1995:67) more recent version ‘if two constructions are syntactically distinct, they must be semantically or pragmatically distinct’. The above assumption is just a little more general than both of these, including any kind of formal distributional difference and any kind of functional aspect. Consider as an example the case of argument structure, or transitivity alternations such as the ‘alternation’ between John sent Mary the book and John sent the book to Mary. Rather than assuming that both syntactic patterns – the ditransitive NPAGT V NPREC NPPAT vs. the prepositional to-dative NPAGT V NPPAT PPto REC – are functionally identical, another perspective (which is not unique to corpus linguists) might be to assume that, following the principle of no synonymy, the difference in the formal patterns will reflect some functional difference and that corpus-based frequency data can be used to identify that difference. Non-corpus-based studies have suggested, among other things having to do with information structure, that the ditransitive pattern is closely associated with transfer of the patient from the agent to the recipient across a small distance, whereas the prepositional to-dative pattern is more associated with transfer over some distance. Now, a corpus-based study by Gries and Stefanowitsch (2004) looked at the statistically preferred verbs in the verb slots of the two patterns and found that the ditransitive’s two most strongly preferred verbs are give and tell, which prototypically involve close proximity of the agent and the recipient, whereas the prepositional dative’s two most strongly preferred verbs are bring and play (as in he played the ball to him), which prototypically involve larger distances: if I stand next to you and hand you something, you don’t use bring, right? These findings are therefore congruent with the proposed semantic difference. Using the same kind of operationalization, i.e. paraphrase of linguistic hypotheses into frequencies, you can investigate an extremely large number of issues. Here’s a (necessarily short and biased) list of examples: ) first language acquisition: how often does a child get to hear particular words or patterns in the input and how does that affect the ease ⁄ speed with which a child acquires these words and patterns (cf. Goodman et al. 2008 on lexical acquisition, Tomasello 2003 on the acquisition of patterns, and Behrens 2008 for a general recent overview)? ) second ⁄ foreign language acquisition: how do we assess second language learners’ lexical proficiency (cf. Laufer and Nation 1995; Meara 2005)? how do we determine what linguistic elements to focus on in instruction and how do we use corpora in language teaching (cf. Stevens 1991; Fox 1998; Conrad 2000; Römer 2006)? (For a recent overview of corpus-based methods in SLA, cf. Gries 2008b). ) language and culture: to what degree do frequencies of words reflect differences between cultures (cf. Hofland and Johansson 1982; Leech and Fallon 1992; Oakes and Farrow 2006)? ) historical developments: how can corpora inform language genealogy (cf. Lüdeling 2006)? how do corpus-based methods contribute to research in grammaticalization (cf. Lindquist and Mair 2004 or Hoffmann 2005)? ) phonology: how well can the degree of phonological assimilation or reduction of an expression be predicted on the basis of its components’ frequency of co-occurrence (cf. Bybee and Scheibman 1999; Gahl and Garnsey 2004; Ernestus et al. 2006)? ) morphology: what do regular and irregular verb forms reveal about the probabilistic nature of the linguistic system (cf. Baayen and Martı́n 2005)? how do we assess the productivity of morphological processes (cf. Baayen and Renouf 1996; Plag 1999)? What is Corpus Linguistics? 5 a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd ) syntax: how can we predict which syntactic choices speakers will make (cf. Leech et al. 1994; Wasow 2002; Gries 2003a,b, Bresnan et al. 2007)? what are the overall frequencies of English grammatical structures (cf. Roland et al. 2007)? ) semantics and pragmatics: how do near synonyms differ from each other (cf. Okada 1999; Oh 2000; Gast 2006; Gries and David 2006; Arppe and Järvikivi 2007; Divjak forthcoming)? how are antonyms acquired (cf. Jones and Murphy 2005)? how can we approach complex multifactorial notions such as idiomaticity and compositionality (cf. Barkema 1993; Langlotz 2006; Wulff 2008)? how come that some words such as happen and set in have a negative twang to them (cf. Whitsitt 2005; Dilts and Newman 2006; Bednarek 2008)? ) plus applications in psycholinguistics (e.g. syntactic priming ⁄persistence), stylistics, sociolinguistics, forensic linguistics (e.g. authorship attribution), etc. Note also that, frequency information per se is often a crucial factor to control in psycholinguistic experiments as frequencies of occurrence are correlated with, among other things, reaction times. Question: Wow, ok, that’s certainly more diverse than I would’ve thought. From what I had seen, I thought most corpus-based work is purely descriptive and maybe lexicographic or applied in nature. Also, I thought that many people would now use the World Wide Web as a corpus – I now often read something like ‘a web search revealed that X is more frequent than Y ...’, and Joseph’s comment you mentioned earlier suggests the same – but then many of the applications you mention don’t sound like as if that can be true, or can it? Answer: Sigh ... well, yes and no. Yes, it’s true that a growing number of linguists now often query a search engine to, say, determine frequencies of words or patterns. Unfortunately, this practice can result in quite some problems, some of which are technical in nature while others are more theoretical. As for the technical problems, it’s well-known by now that the frequencies returned by Google, Yahoo, and other search engines are very variable and may, thus, be unreliable, and web data come with a variety of other problems, too; the special issue 29.3 (2003) of Computational Linguistics, Hundt et al. (2007), and Eu (2008) would be good places to read on this. Question: Yes, I actually heard about the technical problems, but what are the theoretical problems? And then I still want to know what exactly a corpus is! Is that just any collection of ‘(text) files’? And how do you access it? I guess nowadays this is all done computationally? Answer: Feel free to go to London and manually browse the index cards of the Survey of English Usage (http://www.ucl.ac.uk/english-usage/), one of the earliest corpora. Joking aside, yes, nowadays, corpus-linguistic studies are nearly always done computationally as virtually all corpora are text collections stored in the form of plain ASCII or, increasingly commonly, Unicode text files that can be loaded, manipulated, and processed platformindependently. This doesn’t mean, however, that corpus linguists only deal with raw text files; on the contrary, some corpora come with linked audio files or are shipped with sophisticated retrieval software that makes it possible to retrieve immediately the position of an utterance in an audio file or look for precisely defined syntactic and ⁄or lexical patterns etc. It does mean, however, that you would have a hard time finding corpora on 6 Stefan Th. Gries a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd paper, in the form of punch cards or digitally in proprietary binary formats (such as Microsoft Word’s DOC format); the current standard is probably text files with XML annotation (cf. McEnery et al. 2006:28, 35, and passim as well as the website of the Text Encoding Initiative at http://www.tei-c.org/Guidelines/P4/html/SG.html). But, let me come back to the question about the theoretical problems, which will actually also lead to what at least I think a corpus is. These problems are concerned with a general problem of scientific inquiry. When you study any phenomenon, let’s assume a linguistic phenomenon such as the frequencies of words or syntactic constructions, or the productivity of a particular word-formation process or whatever, you usually want to be able to generalize from the data you investigate to, in our cases, the language as a whole or a particular register or genre. Now the issue is, can one generalize from the web or what can one generalize to? Question: Sure, the web contains all sorts of information... . Answer: ... actually, it’s not quite that certain. Yes, the web contains all sorts of information, especially now in the web 2.0 era with community forums, personal webspaces, blogs, etc. But think about it: the question is, what does the web not contain, what does it contain, and how many of the different things there are does it contain? For example, what are contents that you won’t find on the web? Yes, people post very many private things on their sites, but, for instance, personal diaries, intimate conversations, love letters, or other private letters, confessions by criminals, jury transcripts, the conversation between a driving instructor and his student, the language exchanged during a gang brawl may be important for some linguistic studies, but not found on the web (at least not often). True, many of these you will also not find in corpora, but this just goes to show that the web does not contain everything. Also, think about contents you do find on the web a lot: marketing and advertising, computer and tech language – just google Java or Ruby and note how often the island or the gem show up in the first 50 hits – journalese, pornography, scientific texts etc. It’s not at all obvious how well you can generalize from the web. Question: Ok, ok, I get it. But then how are corpora that do not only contain texts from the web better? Answer: Now, many corpora have usually been put together with an eye to taking care of these issues, to make it possible to generalize from a corpus to a language as a whole or at least to a particular variety, register etc. Thus, corpus compilers usually try to make their corpora representative and balanced. By representative, I mean that the different parts of the linguistic variety I’m interested in are all manifested in the corpus. For example, if I was interested in phonological reduction patterns of speech of Californian adolescents and recorded only parts of their conversations with several people from their peer group, my corpus would not be representative in the above sense because it would not reflect the fact that some sizable proportion of the speech of Californian adolescents may also consist of dialogs with a parent, a teacher, etc., which would optimally also have to be included. By balanced, I mean that ideally not only should all parts of which a variety consists be sampled into the corpus but also that the proportion with which a particular part is represented in a corpus should reflect the proportion the part makes up in this variety and ⁄or the importance of the part in this variety. For example, if I know that dialogs make up 65% of the speech of Californian adolescents, approx. 65% of my corpus should consist of dialog recordings. What is Corpus Linguistics? 7 a 2009 The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-818x.2009.00149.x Journal Compilation a 2009 Blackwell Publishing Ltd Question: But that’s ridiculous: we don’t know these percentages! Answer: Sadly enough, that’s true – we don’t. These criteria are admittedly more of a theoretical ideal. Even if resources were no problem, how would we measure the proportion that dialogs make up of the speech of Californian adolescents? We can only record a tiny sample of all Californian adolescents, and how would we measure the proportion of dialogs – in terms of time? in terms of sentences? in terms of words? And if we tried to compile a corpus representative of a language as a whole, then how would we measure the importance of a particular linguistic variety? Question: Well, conversational speech is primary, but I wouldn’t know what the next most important variety is. Answer: I suppose that’s the most widely held belief, and corpus compilers often aim at including as much spoken language as possible because, as you say yourself, spoken language, especially spoken conversation, is typically considered the most basic form of language use. On the other hand, a single catchy and thus salient newspaper headline read by millions of people may have a much larger influence on every reader’s linguistic system and on the language ‘as a whole’ than twenty hours of dialog as usual. Anyway, representative and balanced corpora are a theoretical ideal corpus compilers constantly bear in mind, but the ultimate and exact way of compiling a truly representative and balanced corpus has eluded us so far. If you want to read on this, Biber’s (1990, 1993) work is most instructive, and McEnery et al. (2006: Sections A2, A8, and B1) provide good summary sections.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts

This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...

متن کامل

Promotion of Self in an Other-Oriented Academic Sub-Genre: The Case of Self-Mention in Acknowledgments

Although sometimes considered to act only as a means of recognizing debts, acknowledgments give the opportunity for writers to display a self-conscious and reflective representation of self. Following this assumption and to reveal some of the ways this is achieved, a corpus of 80 textbook acknowledgments in the field of Linguistics and Applied Linguistics were analyzed in order to show what “se...

متن کامل

What Corpus Linguistics can offer Contact Linguistics: the c-oral-brasil corpus experience O que a Linguística de Corpus pode oferecer à Linguística de Contato: a experiência do corpus c-oral-brasil

Contact Linguistics, throughout its history, has been mostly a data-oriented subdiscipline. From the gathering of word lists in colonial settings by pioneer scholars to the current compilation of narratives, interviews and databanks, Contact Linguistics, differently from other Linguistics subdisciplines, has strived to base its findings on the analysis of actual language produced by speakers of...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

Reviewed by

What counts as evidence in linguistics? In our discipline this is a perennial issue, a question to which different schools and generations of linguists have given varying responses. Inductive procedures, the analysis of language samples known as corpora, have been one strong option since Bloomfield's structuralism and have gained increasing momentum in recent years. Technological advances have ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Language and Linguistics Compass

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2009